Last night on twitter there was a bit of a firestorm over this New York Times snippet about p-values (here is my favorite twitter-snark response). While the article has a surprising number of controversial sentences for only 180 words, the most offending sentence is:
By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only to chance.
One problem with this sentence is that it commits a statistical cardinal sin by stating the equivalent of “the null hypothesis is probably true.” The correct interpretation for a p-value greater than 0.05 is “we cannot reject the null hypothesis” which can mean many things (for example, we did not collect enough data).
Another problem I have with the sentence is that the phrase “however good or bad” is incredibly misleading — it’s like saying that even if you see a fantastically big result with low variance, you still might call it “due to chance.” The idea of a p-value is that it’s a way of defining good or bad. Even if there’s a “big” increase or decrease in an outcome, is it really meaningful if there’s bigger variance in that change? (No.)
I’d hate to be accused of being an armchair critic, so here is my attempt to convey the meaning/importance of p-values to a non-stat, non-math audience. I think the key to having a good discussion of this is to provide more philosophical background (and I think that’s why these discussion are always so difficult). Yes this is over three times the word count of the original NYTimes article, but fortunately the internet has no word limits.
(n.b. I don’t actually hate to be accused of being an armchair critic. Being an armchair critic is the best part of tweeting, by far.)
(Also fair warning — I am totally not opening the “should we even be using the p-value” can of worms.)
by Hilary Parker
In the US judicial system, a person on trial is “innocent until proven guilty.” Similarly, in a clinical trial for a new drug, the drug must be assumed ineffective until “proven” effective. And just like in the courts, the definition of “proven” is fuzzy. Jurors must believe “beyond a reasonable doubt” that someone is guilty to convict. In medicine, the p-value is one attempt at summarizing whether or not a drug has been shown to be effective “beyond a reasonable doubt.” Medicine is set up like the court system for good reason — we want to avoid claiming that ineffective drugs are effective, just like we want to avoid locking up innocent people.
To understand whether or not a drug has been shown to be effective “beyond a reasonable doubt” we must understand the purpose of a clinical trial. A clinical trial provides an understanding of the possible improvements that a drug can cause in two ways: (1) it shows the average improvement that the drug gives, and (2) it accounts for the variance in that average improvement (which is determined by the number of patients in the trial as well as the true variation in the improvements that the drug causes).
Why are both (1) and (2) necessary? Think about it this way — if I were to tell you that the price between two books was $10, would you think that’s a big price difference, or a small one? If it were the difference in price between two paperback books, it’d be a pretty big difference since paperback books are usually under $20. However if it were the difference in price between two hardcover books, it’d be a smaller difference in prices, since hardcover books vary more in price, in part because they are more expensive. And if it were the difference in price between two ebook versions of printed books, that’d be a huge difference, since ebook prices (for printed books) are quite stable at the $9.99 mark.
We intuitively understand what a big difference is for different types of books because we’ve contextualized them — we’ve been looking at book prices for years, and understand the variation in the prices. With the effectiveness of a new drug in a clinical trial, however, we don’t have that context. Instead of looking at the price of a single book, we’re looking at the average improvement that a drug causes — but either way, the number is meaningless without knowing the variance. Therefore, we have to determine (1) the average improvement, and (2) the variance of the average improvement, and use both of these quantities to determine whether the drug causes a “good” enough average improvement in the trial to call the drug effective. The p-value is simply a way of summarizing this conclusion quickly.
So, to put things back in terms of the p-value: let’s say that someone reports that NewDrug is shown to increase healthiness points by 10 points on average, with a p-value of 0.01. The p-value provides some context for whether or not an average increase of 10 points in this trial is “good” enough for the drug to be called effective, and is calculated by looking at both the average improvement, and the variance in improvements for different patients in the trial (while also controlling for the number of people in the trial). The correct way to interpret a p-value of 0.01 is: “If in reality NewDrug is NOT effective, then the probability of seeing an average increase in healthiness points at least this big if we repeated this trial is 1% (0.01*100).” The convention is that that’s a low enough probability to say that NewDrug has been shown to be effective “beyond a reasonable doubt” since it is less than 5%. (Many, many statisticians loathe this cut-off, but it is the standard for now.)
A key thing to understand about the p-value in a clinical trial is that a p-value greater than 0.05 doesn’t “prove” that NewDrug is ineffective — it just means that NewDrug wasn’t shown to be effective in the clinical trial. We can all think of examples of court cases where the person was probably guilty, but was not convicted (OJ Simpson, anyone?). Similarly with the p-value, if a trial reports a p-value greater than 0.05, it doesn’t “prove” that the drug is ineffective. It just means that researchers failed to show that the drug was effective “beyond a reasonable doubt” in the trial. Perhaps the drug really is ineffective, or perhaps the researchers simply did not gather enough samples to make a convincing case.
edit: In case you’re doubting my credentials regarding p-values, I’ll just leave this right here…